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Abstract 

This paper aims at developing an integrated system of 
clothing co-parsing, in order to jointly parse a set of cloth¬ 
ing images (unsegmented but annotated with tags) into se¬ 
mantic configurations. We propose a data-driven frame¬ 
work consisting of two phases of inference. The first phase, 
referred as ''image co-segmentation’’, iterates to extract 
consistent regions on images and jointly refines the regions 
over all images by employing the exemplar-SVM (E-SVM) 
technique [23]. In the second phase (i.e. "region co¬ 
labeling”), we construct a multi-image graphical model by 
taking the segmented regions as vertices, and incorporate 
several contexts of clothing configuration fe.g., item loca¬ 
tion and mutual interactions). The joint label assignment 
can be solved using the efficient Graph Cuts algorithm. 
In addition to evaluate our framework on the Fashionista 
dataset [30], we construct a dataset called CCP consist¬ 
ing of2098 high-resolution street fashion photos to demon¬ 
strate the performance of our system. We achieve 90.29% 
/ 88.23% segmentation accuracy and 65.52% / 63.89% 
recognition rate on the Fashionista and the CCP datasets, 
respectively, which are superior compared with state-of- 
the-art methods. 


1. Introduction 

Clothing recognition and retrieval have huge potentials 
in internet-based e-commerce, as the revenue of online 
clothing sale keeps highly increasing every year. In com¬ 
puter vision, several interesting works [30, 5, 17, 16] have 
been proposed on this task and showed promising results. 

* Corresponding author is Liang Lin. This work was sup¬ 
ported by the Hi-Tech Research and Development Program of 
China (no.2013AA013801), Guangdong Natural Science Foundation 
(no.S2013050014548), Program of Guangzhou Zhujiang Star of Science 
and Technology (no.2013J2200067), Special Project on Integration of In¬ 
dustry, Educationand Research of Guangdong (no.20I2B09II00I48), and 
Fundamental Research Funds for the Central Universities (no.l31gjc26). 


On one hand, pixelwise labeling of clothing items within 
images is one of the key resources for the above researches, 
but it often costs expensively and processes inefficiently. On 
the other hand, it is feasible to acquire image-level clothing 
tags based on rich online user data. Therefore, an interest¬ 
ing problem arises, which is the focus of this paper: How to 
jointly segment the clothing images into regions of clothes 
and simultaneously transfer semantic tags at image level to 
these regions. 

The key contribution of this paper is an engineered and 
applicable system^ to jointly parse a batch of clothing im¬ 
ages and produce accurate pixelwise annotation of clothing 
items. We consider the following challenges to build such a 
system: 

• The appearances of clothes and garment items are often 
diverse with different styles and textures, compared with 
other common objects. It is usually hard to segment and 
recognize clothes via only bottom-up image features. 

• The variations of human poses and self-occlusions are 
non-trivial issues for clothing recognition, although the 
clothing images can be in clear resolution and nearly 
frontal view. 

• The number of fine-grained clothes categories is very 
large, e.g., more than 50 categories in the Fashionista 
dataset [30]; the categories are relatively fewer in exist¬ 
ing CO- segmentation systems [9, 7]. 

To address the above issues, we develop the system con¬ 
sisting of two sequential phases of inference over a set of 
clothing images, i.e. image co-segmentation for extracting 
distinguishable clothes regions, and region co-labeling for 
recognizing various garment items, as illustrated in Fig¬ 
ure 1. Furthermore, we exploit contexts of clothing con¬ 
figuration, e.g., spatial locations and mutual relations of 
clothes items, inspired by the successes of object/scene con¬ 
text modeling [14, 19, 12]. 


^http://vision.sysu.edu.cn/projects/clothing-co-parsing/ 
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(a) Phase I: Co-Segmentation (b) Phase II: Co-Labeling 


Figure 1. Illustration of the proposed clothing co-parsing framework, which consists of two sequential phases of optimization: (a) clothing 
CO- segmentation for extracting coherent clothes regions, and (b) region co-labeling for recognizing various clothes garments. Specifically, 
clothing CO- segmentation iterates with three steps: (al) grouping superpixels into regions, (a2) selecting confident foreground regions to 
train E-SVM classifiers, and (a3) propagating segmentations by applying E-SVM templates over all images. Given the segmented regions, 
clothing co-labeling is achieved based on a multi-image graphical model, as illustrated in (b). 


In the phase of image co-segmentation, the algorithm it¬ 
eratively refines the regions grouped over all images by em¬ 
ploying the exemplar-SVM (E-SVM) technique [23]. First, 
we extract superpixels and group them into regions for each 
image, where most regions are often cluttered and mean¬ 
ingless due to the diversity of clothing appearances and hu¬ 
man variations, as shown in Figure I (al). Nevertheless, 
some coherent regions (in Figure I (a2)) can be still se¬ 
lected that satisfy certain criteria {e.g., size and location). 
Then we train a number of E-SVM classifiers for the se¬ 
lected regions using the HOG feature, i.e., one classifier for 
one region, and produce a set of region-based detectors, as 
shown in Figure I (a3), which are applied as top-down tem¬ 
plates to localize similar regions over all images. In this 
way, segmentations are refined jointly, as more coherent re¬ 
gions are generated by the trained E-SVM classifiers. This 
process is inspired by the observation that clothing items 
of the same fine-grained category often share similar pat¬ 
terns {i.e. shapes and structures). In the literature, Kuettel et 
al. [10] also proposed to propagate segmentations through 
HOG-based matching. 

Given the segmented regions of all images, it is very dif¬ 
ficult to recognize them by only adopting supervised learn¬ 
ing due to the large number of fine-grained categories and 
the large intra-class variance. In contrast, we perform the 
second phase of co-labeling in a data-driven manner. We 
construct a multi-image graphical model by taking the re¬ 
gions as vertices of graph, inspired by [27] . In our model, 


we link adjacent regions within each image as well as re¬ 
gions across different images, which shares similar appear¬ 
ance and latent semantic tags. Thus we can borrow statis¬ 
tical strength from similar regions in different images and 
assign labels jointly, as Figure 1 (b) illustrates. The op¬ 
timization of co-labeling is solved by the efficient Graph 
Cuts algorithm [4] that incorporates several constraints de¬ 
fined upon the clothing contexts. 

Moreover, a new database with groundtruths is proposed 
for evaluating clothing co-parsing, including more realistic 
and general challenges, e.g. disordered backgrounds and 
multiform human poses, compared with the existing clothes 
datasets [3, 6, 30]. We demonstrate promising performances 
and applicable potentials of our system in the experiments. 

1.1. Related Work 

In literature, existing efforts on clothing/human segmen¬ 
tation and recognition mainly focused on constructing ex¬ 
pressive models to address various clothing styles and ap¬ 
pearances [6, 8, 2, 28, 21, 30, 22]. One classic work [6] 
proposed a composite And-Or graph template for modeling 
and parsing clothing configurations. Eater works studied 
on blocking models to segment clothes for highly occluded 
group images [28], or deformable spatial priors modeling 
for improving performance of clothing segmentation [8]. 
Recent approaches incorporated shape-based human model 
[2], or pose estimation and supervised region labeling [30], 
and achieved impressive results. Despite acknowledged 




































successes, these works have not yet been extended to the 
problem of clothing co-parsing, and they often require much 
labeling workload. 

Clothing co-parsing is also highly related to im¬ 
age/object co-labeling, where a batch of input images con¬ 
taining similar objects are processed jointly [18, 20, 13]. 
For example, unsupervised shape guided approaches were 
adopted in [11] to achieve single object category co¬ 
labeling. Winn et. al. [29] incoporated automatic im¬ 
age segmentation and spatially coherent latent topic model 
to obtain unsupervised multi-class image labeling. These 
methods, however, solved the problem in an unsupervised 
manner, and might be intractable under circumstances with 
large numbers of categories and diverse appearances. To 
deal with more complex scenario, some recent works fo¬ 
cused on supervised label propagation, utilizing pixelwise 
label map in the training set and propagating labels to un¬ 
seen images. Pioneering work of Liu et al. [18] proposed to 
propagate labels over scene images using a bi-layer sparse 
coding formulation. Similar ideas were also explored in 
[15]. These methods, however, are often limited by expen¬ 
sive annotations. In addition, they extracted image corre¬ 
spondences upon the pixels (or superpixels), which are not 
discriminative for the clothing parsing problem. 

The rest of this paper is organized as follows. We first 
introduce the probabilistic formulation of our framework in 
Section 2, and then discuss the implementation of the two 
phases in Section 3. The experiments and comparisons are 
presented in Section 4, and finally comes the conclusion in 
Section 5. 

2. Probabilistic Formulation 

We formulate the task of clothing co-parsing as a prob¬ 
abilistic model. Let I = denote a set of clothing 

images with tags {Ti}fLi. Each image I is represented by 
a set of superpixels / = {sj}jLi, which will be further 
grouped into several coherent regions under the guidance of 
segmentation propagation. Each image / is associated with 
four additional variables: 

a) the regions each of which is consisted of a set 

of superpixels; 

b) the garment label for each region: G T, /c = 1,..., Ef ; 

c) the E-SVM weights trained for each selected region; 

d) the segmentation propagations C = {x^y^m), where 
(x, y) is the location and m is the segmentation mask 
of an E-SVM, indicating segmentation mask m can be 
propagated to the position (x, y) of /, as illustrated in 
Figure 1 (a). 

Let K = {R^ = {vik}}, L = {Li = {iik}}, W = 
[Wi = {wik}} andC = {Ci}. We optimize the parameters 
by maximizing the following posterior probability: 


{L*,R*,W*,C*} = argmaxP(L,R,W,C|I), (1) 
which can be factorized as 

CO —labeling 

P(L, R, W, C}|I) oc P(L|R, C) X 

CO —segmentation 
N 

Y[P{Ri\CiJi)P{Wi\R4) X P(a|W,P). 

i=l 

( 2 ) 

The optimization of Eq. (2) includes two phases: (I) cloth¬ 
ing image co-segmentation and (II) co-labeling. 

In phase (I), we obtain the optimal regions by maxi¬ 
mizing P{R\C,I) in Eq. (2). We introduce the super¬ 
pixel grouping indicator Oj G {I,-- - ,Pr}, which indi¬ 
cates to which of the K regions the superpixel Sj belongs. 
Then each region can be denoted as a set of superpixels, as 
Vk = {sj\oj = k}. Given the current segmentation propa¬ 
gation C, P{R\C, I) can be defined as, 

PiR\C,I) = '[[Piri.lC,!) cc llPiojlC,!) 

k j 

M (3) 

oc n P{Oj,Sj) n^( O'm’) Ow) Sm, Sn \C), 

j = l mn 

where the unary potential P{oj,Sj) oc exp{—(i(sj, Oj)} 
indicates the probability of superpixel Sj belongs to a re¬ 
gion, where d{sj,Oj) evaluates the spatial distance between 
Sj and its corresponding region. P{om^ On^Sm^ Sn\C) is the 
pairwise potential function, which encourages smoothness 
between neighboring superpixels. 

After grouping superpixels into regions, we then select 
several coherent regions to train an ensemble of E-SVMs, 
by maximizing P{W\R) defined as follows, 

P{W\R) = W_P{'Wk\rk) oc Y\_exp{-E{wk,rk) ■ (f>{rj)}, 

k k 

(4) 

where (j){rj) is an indicator exihibiting whether Vj has been 
chosen for training E-SVM. E{wk,rk) is the convex energy 
function of E-SVM. 

Finally, we define P{Ci\'W,Ii) in Eq. (2) based on the 
responses of E-SVM classifiers. This probability is max¬ 
imized by selecting the top k detections of each E-SVM 
classifier as the segmentation propagations by the sliding 
window scheme. 

In phase (II), we assign a garment tag to each region by 
modeling the problem as a multi-image graphical model. 
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where P^liki^ik) represents the singleton potential of as¬ 
signing label lik to region Vik, and P{irnJnirmirn) the 
interior affinity model capturing compatibility among re¬ 
gions within one image, and the exte¬ 

rior affinity model for regions belonging to different images, 
in which Vu and Vy are connected under the segmentation 
propagation C. Details are discussed in Section 3.2. 

3. Clothing Co-Parsing 

In this section, we describe the two phases of cloth¬ 
ing co-parsing, including their implementation details. The 
overall procesure is outlined in Algorithm 1 . 

3.1. Unsupervised Image Co-Segmentation 

The optimization in the co-segmentation is to estimate 
a variable while keeping others fixed, e.g., estimating R, 
with W, C fixed. Thus the first phase iterates between three 
steps: 

i. Superpixel Grouping: The MRF model defined in Eq. 
(3) is the standard pipeline for superpixel grouping. How¬ 
ever, the number of regions need to be specified, which is 
not an applicable assumption of our problem. 

To automatically determine the number of regions, we 
replace the superpixel indicator Oj by a set of binary vari¬ 
ables o® defined on the edges between neighboring super¬ 
pixels. Let e denote an edge, o® = 1 if two superpixels 
s\ and ^2 connected by e belong to the same region, other¬ 
wise o® = 0. We also introduce a binary variable with 
= 1 indicating all the superpixels covered by the mask of 
the segmentation propagation c belong to the same region, 
otherwise = 0. Then maximizing Eq. (3) is equivalent 
to the following linear programming problem, 

H ^2) • o® - y] ^ c}) ■ (g) 

e ceC 

where d(sf, is the dissimilarity between two superpix¬ 
els, and h{-) measures the consistence of grouping all su¬ 
perpixels covered by an E-SVM mask into one region. Eq. 
(6) can be efficiently solved via the cutting plane algorithm 
as introduced in [24] . 

The dissimilarity d{s\^ S 2 ) in Eq. (6) is defined together 
with the detected contours: d{si,S 2 ) = 1 if there exists 
any contour across the area covered by sf and S 2 , otherwise 


(i(sf, S 2 ) = 0. h{{sj\sj C c}) is defined as the normalized 
total area of the superpixels covered by the template c. 

ii. Training E-SVMs: The energy E{wkirk) introduced in 
Eq. (4) is the convex energy function of E-SVM as follows, 

^||i/;fc||^-fAimax(0, /(rfc)>fA 2 ^ max(0, /(rn)). 

(7) 

Ne denotes the negative examples, and /(•) is the feature 
of a region, following [23]. Ai and A 2 are two regulariza¬ 
tion parameters. Thus maximizing P{W\R) is equivalent 
to minimizing the energy in Eq. (7), i.e., training the pa¬ 
rameters of the E-SVM classifiers by the gradient descent. 

We train an E-SVM classifier for each of the selected 
regions: each selected region is considered as a positive ex¬ 
ample (exemplar), and a number of patches outside the se¬ 
lected region are croped as negative examples. In the imple¬ 
mentation, we use HOG as the feature for each region. The 
region selection indicator 0(rj) in Eq. (4) is determined by 
the automated saliency detection [25]. Eor computational 
efficiency, we only train E-SVMs for high confident fore¬ 
ground regions, i.e. regions containing garment items. 

iii. Segmentation Propagation: We search for possible 
propagations by sliding window method. However, as each 
E-SVM is trained independently, their responses may not 
be compatible. We thus perform the calibration by fitting 
a logistic distribution with parameters aE and /Se on the 
training set. Then the E-SVM response can be defined as, 

SE{f;w) = — - -— ^ ( 8 ) 

1 + exp{-Q!_E(«;-' / - Pe) 

where / is the feature vector of the image patch covered by 
the sliding window. 

3.2. Contextualized Co-Labeling 

In this phase, each image is represented by a set of co¬ 
herent regions, and we assign a garment tag to each region 
by optimizing a multi-image graphical model , i.e. an MRE 
connecting all the images in the database, which is defined 
in Eq. (5). We define two types of edges on the graph: the 
interior edges connecting neighboring regions within an im¬ 
age, and the exterior edges connecting regions of different 
images matched by the propagated segmentation. A toy ex¬ 
ample of the graphical model is showed in Eigure 2. Specif¬ 
ically, two types of clothing contexts are exploited provid¬ 
ing informative constraints during inference. 

The singleton potential P(^/c,r/c) defined in Eq. (5) 
incorporates a region appearance model with the garment 
item location context. 

Eor each type of garment, we train the appearance model 
as an SVM classifier based on local region appearance. Let 
S{f{rk)i(^k) denote the score of the appearance model, and 
/(r/c) a feature vector of 40-bins concatenated by the color 




Figure 2. We perform co-labeling by optimizing a multi-image 
graphical model, i.e. an MRF connecting all the images in the 
database. A toy example of the model is illustrated above, where 
the green solid lines are interior edges between adjacent regions 
within the same images while the black dashed lines are exterior 
edges across different images. Note that the connections among 
different images are determined by the segmentation propagation. 

and gradient histograms. We define the potential of assign¬ 
ing label to region as, 

P(4,r^) = sig(P(/(rfe),4)) • (9) 

where sig(') indicates the sigmoid function, and Xk the cen¬ 
ter of region r/^. The location context G£j^{Xk) is defined 
upon the 2-D Gaussian distribution as, 

G4(X,)~AA(m4,S4), (10) 

where and represent the mean and the covariance 
of the location of garment item 1^, respectively, which can 
be estimated over the training set. 

The interior affinity P{irn^^ni in Eq. (5) of two 

adjacent regions Vm and is defined on two terms within 
an image: their appearance compatibility and mutual inter¬ 
actions, as 

The appearance compatibility function 4,^n) 

encourages regions with similar appearance to have the 
same tag: 

^ni — exp| l('^m — ^n)} (1^^) 

where !(•) is the indicator function, and d{rjn^rn) is the 
-distance between the appearance feature of two regions. 
^(^m4n) models the mutual interactions of two differ¬ 
ent garment items and This term is simple but effec¬ 
tive, since some garments are likely to appear as neighbors 
in an image, e.g. coat and pants, while others are not, e.g. 
hat and shoes. In practice, we calculate by ac¬ 

cumulating the frequency of they appearing as neighbors 
over all adjacent image patches in the training data. 


The exterior affinity (3(4,4, of Eq. (5) 

across different images constrains that regions in differ¬ 
ent images sharing similar appearance and locations should 
have high probability to have the same garment tag. We 
thus have, 

Q{iuXv,ru,rv\C) = G^^{Xu)G^^{Xy)^{^uXv,ru,rv), 

(13) 

in which the terms were clearly defined in Eq.(lO) and 
Eq.(12). Einally, we adopt the Graph Cuts to optimize the 
multi-image graphical model. 


Algorithm 1 The Sketch of Clothing Co-Parsing 

Input: 

A set of clothing images I = {4}^! with tags 

Output: 

The segmented regions R with their corresponding labels 

L. 

Phase (I): Image Co-Segmentation 

Repeat 

1 Eor each image /, group its superpixels into regions R 
under the guidance of the segmentation propagations 
G by maximizing P{R\G, I) in Eq. (3); 

2 Train E-SVM parameters for each selected region by 
minimizing the energy in Eq. (7). 

3 Propagate segmentations across images by detections 
from the trained E-SVM classifiers by Eq. (8). 

Until Regions are not changed during the last iteration 

Phase (II): Contextualized Co-Labeling 

1 Construct the multi-image graphical model; 

2 Solving the optimal label assignment L* by optimiz¬ 
ing the probability defined on the graphical model as 
in Eq. (5) by Graph Cuts. 


4. Experiments 

We first introduce the clothing parsing datasets, and 
present the quantitative results and comparisons. Some 
qualitative results are exhibited as well. 

Parameter settings: We use gPb contour detector [1] to 
obtain superpixels and contours, and the threshold of the 
detector is adapted to obtain about 500 superpixels for each 
image. Contours help define d{s\^ S 2 ) in Eq. (6) were ob¬ 
tained by setting the threshold to 0.2. Eor training E-SVMs, 
we set Ai = 0.5 and A 2 = 0.01 in Eq. (7) to train E-SVMs. 
The appearance model in Sec. 3.2 is trained by a multi-class 
SVM using one-against-one decomposition with an Gaus¬ 
sian kernel. 












4.1. Clothing Image Datasets 

We evaluate our framework on two datasets: Clothing 
Co-Parsing^ (CCP) and Fashionista [30]. CCP is created by 
us, consisting of 2, 098 high-resolution fashion photos with 
huge human/clothing variations, which are in a wide range 
of styles, accessories, garments, and poses. More than 1000 
images of CCP are with superpixel-level annotations with 
totally 57 tags, and the rest of images are annotated with 
image-level tags. All annotations are produced by a profes¬ 
sional team. Some examples of CCP are shown in Figure 4. 
Fashionista contains 158, 235 fashion photos from fashion 
blogs which are further separated into an annotated subset 
containing 685 images with superpixel-level ground truth, 
and an unannotated subset associated with possibly noisy 
and incomplete tags provided by the bloggers. The anno¬ 
tated subset of Fashionista contains 56 labels, and some 
garments with high occurrences in the dataset are shown 
in Figure 3. 

4.2. Quantitative Evaluation 

To evaluate the effectiveness of our framework, we com¬ 
pare our method with three state-of-arts: (1) PECS [30] 
which is a fully supervised clothing parsing algorithm that 
combines pose estimation and clothing segmentation, (2) 
the M-layer sparse coding (BSC) [18] for uncovering the la¬ 
bel for each image region, and (3) the semantic texton forest 
(STF) [26], a standard pipeline for semantic labeling. 

The experiment is conducted both on Fashionista and 
CCP datasets. Following the protocol in [30], all measure¬ 
ments use 10-fold cross validation, thus 9 folds for train¬ 
ing as well as for tuning free parameters, and the remain¬ 
ing for testing. The performances are measured by average 
Pixel Accuracy (aPA) and mean Average Garment Recall 
(mAGR), as in [30]. As background is the most frequent 
label appearing in the datasets, simply assigning all regions 
to be background achieves 77.63% / 77.60% accuracy, and 
9.03% / 15.07% mAGR, on the Fashionista and the CCP 
dataset respectively. We treat them as the baseline results. 

Table 1 reports the clothing parsing performance of each 
method on the Fashionista and CCP datasets. On both 
datasets, our method achieves much superior performances 
over the BSC and STF methods, as they did not address the 
specific clothing knowledges. We also outperform the state- 
of-the-art clothing parsing system PECS on both datasets. 
As images of the CCP database include more complex back¬ 
grounds and clothing styles, the advantage of our approach 
is better demonstrated. In fact, the process of iterative image 
CO- segmentation effectively suppresses the image clutters 
and generates coherent regions, and the co-labeling phase 
handles better the variants of clothing styles by incorporat¬ 
ing rich priors and contexts. In addition, we report the aver- 

^ http ://vision. sy su. edu. cn/proj ects/clothing-co-parsing/ 


Fashionista 

CCP 

Methods 

aPA 

mAGR 

aPA 

mAGR 

Ours-full 

90.29 

65.52 

88.23 

63.89 

PECS [30] 

89.00 

64.37 

85.97 

51.25 

BSC [18] 

82.34 

33.63 

81.61 

38.75 

STF [26] 

68.02 

43.62 

66.85 

40.70 

Ours-1 

89.69 

61.26 

87.12 

61.22 

Ours-2 

88.55 

61.13 

86.75 

59.80 

Ours-3 

84.44 

47.16 

85.43 

42.50 

Baseline 

77.63 

9.03 

77.60 

15.07 


Table 1. Clothing parsing results (%) on the Fashionista and the CCP 
dataset. As background is the most frequent label appears in the datasets, 
assigning all regions to be background is adopted our baseline comparison. 
We compare our full system (Ours-full) to three state-of-the-art methods 
including PECS [30], BSC [18], and STF [26]. We also present an empir¬ 
ical analysis to evaluate the effectiveness of the main components of our 
system. Ours-1 and Ours-2 evaluate the effectiveness of the co-labeling 
phase by only employing the exterior affinity, and by only using the inte¬ 
rior affinity, respectively. Ours-3 evaluates the performance of superpixel 
grouping in the co-segmentation phase. 



Figure 3. Average recall of some garment items with high occur¬ 
rences in Fashionista. 

age recall of several frequently occurring garment items in 
Fashionista dataset in Figure 3. 

Evaluation of Components. We also present an empiri¬ 
cal analysis to demonstrate the effectiveness of the main 
components of our system. Ours-1 and Ours-2 in Table 1 
evaluate the effectiveness of the co-labeling phase by only 
employing the exterior affinity, and by only using the inte¬ 
rior affinity, respectively. Ours-3 evaluates the performance 
of superpixel grouping in the co-segmentation phase. Ours- 
1 achieves the best result compared to Ours-2 and Ours-3 
due to the importance of mutual interactions between gar¬ 
ment items, thus performing co-labeling on a multi-image 
graphical model benefits the clothing parsing problem. 

4.3. Qualitative Evaluation 

Figure 4 illustrates some successful parsing results for 
exemplary images from both Fashionista and CCR Our 
method is able to parse clothes accurately even in some 
challenging illumination and complex background condi- 
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Figure 4. Some successful parsing results on (a) Fashionista (b) CCR 
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tions (rlc2^, r4c2). Moreover, our method could even parse 
some small garments such as belt (rlcl, r2cl, r2c2, r3c2), 
purse (rlc3), hat (rlc4, r2c3), and sunglasses (r4c2). For 
reasonably ambiguous clothing patterns such as dotted t- 
shirt or colorful dress, our framework could give satisfying 
results (r2c4, r5c2). In addition, the proposed method could 
even parse several persons in a single image simultaneously 
(r5c5). 

Some failure cases are shown in Figure 5. Our co¬ 
parsing framework may lead wrong results under following 
scenarios: (a) ambiguous patterns exist within a clogging 
garment item; (b) different clothing garment items share 
similar appearance; (c) background is extremely disordered; 
(d) illumination condition is poor. 

4.4. Efficiency 

All the experiments are carried out on an Intel Dual- 
Core E6500 (2.93 GHz) CPU and 8GB RAM PC. The run- 

^We use “rlcl” to denote the image in row 1, column 1. 


time complexity of the co-segmentation phase scales lin¬ 
early with the number of iterations, and each iteration costs 
about 10 sec per image. The co-labeling phase costs less 
than 1 minute to assign labels to a database of 70 images, 
which is very effective due to the consistent regions ob¬ 
tained from the co-segmentation phase. And the Graph Cuts 
algorithm converges in 3-4 iterations in our experiment. 

5. Conclusions 

This paper has proposed a well-engineered framework 
for joint parsing a batch of clothing images given the image- 
level clothing tags. Another contribution of this paper is a 
high-resolution street fashion photos dataset with annota¬ 
tions. The experiments demonstrate that our framework is 
effective and applicable compared with the state-of-the-art 
methods. In future work, we plan to improve the inference 
by iterating the two phases to bootstrap the results. In addi¬ 
tion, the parallel implementation would be studied to adapt 
the large scale applications. 
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